Integrated document viewer

ABSTRACT

In various embodiments of the present invention, documents (eg, PDFs) are converted into HTML 5 (and CSS 3) formats and integrated into existing HTML 5 web pages to preserve the original embedded fonts. The fonts can also be integrated or embedded (e.g., via the standard HTML “iframe” tag) into other web pages. The original appearance of the source document is maintained, the text is preserved as searchable text, and the document is integrated into a web page that can be searched, zoomed, scrolled, and printed utilizing standard web browser controls. A significantly increased “ad inventory” is thereby enabled, wherein advertisements can be integrated between pages, or even within a page. Moreover, the resulting document can be passively shared with members of a user&#39;s external social networks (including those within the host website), along with other activities and behaviors performed by the user on the hosting website.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a Divisional of U.S. patent application Ser. No.13/189,372 filed Jul. 22, 2011 which is a Divisional of U.S. patentapplication Ser. No. 12/912,625 filed Oct. 26, 2010 both titled:“Integrated Document Viewer With Automatic Sharing of Reading-RelatedActivities Across External Social Networks,” which claims the benefit(pursuant to 35 U.S.C. §119(e)) of (i) U.S. Provisional PatentApplication No. 61/326,166, filed Apr. 20, 2010, entitled “IntegratedDocument Viewer with Automatic Sharing of Reading-Related ActivitiesAcross External Social Networks,” and (ii) U.S. Provisional PatentApplication No. 61/330,161, filed Apr. 30, 2010, entitled “IntegratedDocument Viewer with Automatic Sharing of Reading-Related ActivitiesAcross External Social Networks with Additions.” The entire disclosureof all of them are expressly incorporated herein by reference in theirentireties.

I. BACKGROUND

A. Field of Art

This application relates generally to the integration of documents intoweb pages, and in particular to systems and techniques for preserving adocument's original nature and appearance when displaying the documentwithin the pages of a website, and automatically sharing users'reading-related activities on that website across their external socialnetworks.

B. Description of Related Art

Well before the advent of the Internet and the World Wide Web, softwaredevelopers struggled to display documents on a computer monitor in theform intended by the authors of such documents. Initially, documentsdisplayed on a computer screen were limited to text, with little or nochoice of fonts, much less page layout and formatting of any kind. Asword processors and other presentation programs evolved, fonts wereintegrated and other media were added (such as images, animation andeven video), along with page layout features for presenting the variouscomponents of a document with a particular appearance desired by thedocument's author. Moreover, documents themselves have evolved wellbeyond traditional text, to include various different static andinteractive media and page layout attributes, and to appear in manydifferent forms, ranging from short emails or blog posts to bookpreviews, news articles and creative writing samples, to long novels orreference books, and almost anything in between.

As the Web gained traction in the early to mid 1990s, an entirely newmedium for presenting and distributing documents evolved, and a new typeof document was created—namely, the “web page” within a “website”containing a collection of related (and often linked) web pages. Thisnew type of document, employing a document format known as “HypertextMarkup Language” (HTML), also went through a similar evolution to thatof traditional documents, initially being limited to text, and soonadding other media, including images, animation, and video, as well ashyperlinks, buttons and various other interactive objects andfunctionality.

Whether an author initially creates a document as a web page (typicallydisplayed via a program known as a “web browser”) or as a moretraditional page-oriented document (i.e., a document that is inherentlydivided into pages corresponding to static “printable” pages), theauthor intends for the document to be printed or displayed on a computermonitor with a particular desired appearance. A document's appearanceincludes a variety of presentation and page layout characteristics, suchas the position, size and orientation of various component text, graphicand other static and interactive objects on each page of the document.It should be noted that the nature or functionality of these objecttypes also is generally intended to be preserved, particularly whendisplayed on a computer monitor.

Of particular importance, however, are the various fonts associated withspecific text, which themselves have various attributes, including fonttype, size, style, etc. Given that most documents consist primarily oftext, it is not surprising that the particular fonts employed within adocument play a significant role in the document's overall appearance.

Maintaining a document's appearance as it is distributed among differentcomputers and platforms (including its appearance when printed ordisplayed within a web page) has long been a problem addressed byvarious software technologies. For example, if a document is createdwith a particular word processing program and transferred to anothercomputer which does not have access to that program, then the documentmay not even be accessible on the destination computer, or may only beaccessible via another program that displays the document with amodified appearance (e.g., with different fonts or other formattingattributes).

One of the leading solutions to this problem, even pre-dating the Web,is the “portable document format” (PDF) created by Adobe Systems, Inc.The PDF is designed to preserve fonts, as well as page layout and otherobject and document formatting characteristics, so that documents retaina virtually identical appearance when distributed across computers andplatforms, displayed on a computer monitor or printed onto a physicalmedium, such as paper. For this reason, the PDF has become a widelyadopted standard document format for printing and distributing documentsacross computers and platforms, regardless of which program thedocument's author used to create the document.

At this point, it is virtually impossible to distinguish the appearanceof a document created as a web page (HTML) from that of one created as amore traditional page-oriented document via a word processing,presentation or page layout program. Both can contain various mediatypes, from static text and graphics to animation, video and otherinteractive objects and functionality, such as hyperlinks, buttons andother controls. Moreover, both can be printed as static pages onphysical paper, even though HTML documents are not generally dividedinto distinct pages unless and until they are printed. Finally, both canbe converted into PDF documents so as to retain their intendedappearance when printed or distributed among different computers andplatforms.

Even PDF documents, however, have been difficult to integrate into webpages, while preserving their intended appearance, due to historicalformatting limitations of the HTML format, which traditionally hasallowed for the display of only a limited number of fonts. For example,Adobe and others have created programs that display existing PDFdocuments within a web browser's window. Yet, these programs cause thedocument to occupy the entire web browser window (along with thecontrols typically associated with Adobe's “Acrobat” program fordisplaying PDF documents). In other words, although the PDF document mayappear within a web browser's window, it is not truly integrated intoanother web page; instead it becomes a distinct “web page” of its own.Thus, the author of a web page cannot easily integrate an existing PDFdocument as part of a web page that includes other web elements orobjects, such as text, images, advertisements, etc.

Other approaches to this problem include programs that use Adobe “Flash”(or other programming languages/platforms) to display a PDF document ina distinct window within a web page, preserving the appearance of thePDF document while still allowing for other components of the web pageto be displayed within the same web browser window. This approach has anumber of disadvantages, however, in that the PDF document is not trulyintegrated into the web page; instead it remains in a separatelycontrolled window within that web page. For example, a user must scrollthrough the PDF document separately from the rest of the web page,resulting in the significant inconvenience of having to switch betweenscrolling through the PDF document and scrolling through the web page.Moreover, the “zoom” level and controls of the PDF document are distinctfrom those of the web page, often forcing the user to zoom the PDFdocument to a desired level for reading, but switch to a “global” zoomlevel to read the other components of the web page (text, images, ads,etc), and then reset the zoom level of the PDF document to continuereading (often while repeatedly readjusting the scrolling positions ofthe PDF document and the overall web page). In short, the PDF documentbecomes a separately controllable object that is subservient to theprimary web browser controls for the overall web page window, resultingin significant inconvenience to the user.

Other approaches include PDF-to-HTML converters that enable theintegration of the PDF document into a web page containing othercomponent elements, but do so by sacrificing the original appearance ofthe document. For example, they convert the fonts embedded within thePDF document into the limited number of fonts typically made availableto a computer's web browser. This approach defeats the primary objectiveof preserving the author's intended appearance of the PDF document.

Yet another approach involves converting the PDF document into an“image” which preserves its intended appearance while allowing for othercomponents of the web page to be displayed within the same web browserwindow. To the extent this approach employs a separately scrollablewindow, it suffers from the same disadvantages as noted above. Even ifthe image of the entire document is truly integrated into a discretearea of the web page (as opposed to a separate scrollable “sub-window”),this approach, while preserving the appearance of text, does notpreserve the nature of the text itself. In other words, the ability tosearch and recognize the text is sacrificed, which results in asignificant loss of functionality. Not only are users unable to searchthrough the PDF document, but other programs cannot search through andidentify words and phrases within the PDF document, a critical featurefor targeted advertising engines.

Google has adopted a variation of this approach with its “Google PDFviewer,” which is integrated into its “Gmail,” “Google Docs” and otherprograms. While each page of a PDF document is still converted into an“image” under this approach, users can search for individual wordswithin the document by virtue of Google's “thin client” approach, whichrelies upon frequent interaction between the user's web browser and aremote web server.

For example, upon detecting that the user has attempted to select a wordby clicking on the portion of the image containing that word, the user'sweb browser invokes the remote web server, which must parse the page ofthe PDF document to identify the “text” version of that word (e.g., theindividual ASCII characters of the word), which can then be sent to theuser's web browser, for example, to highlight the word or permit it tobe copied and pasted elsewhere. Moreover, a user can search for wordswithin the document by typing them into the user's web browser, whichagain must invoke the remote web server to conduct the search on the“text” within the PDF document, and then return the results to theuser's web browser.

Yet, this “thin client” approach suffers from a number of disadvantagesthat result from converting the PDF document into an “image” rather thandirectly into text (along with the fonts that determine the appearanceof that text). For example, the “image” of each page of the document issignificantly larger than the corresponding text on that page (evenapart from other non-text elements on the page), resulting in anadditional delay before each page of the document can be delivered toand displayed by the user's web browser.

Moreover, the frequent server interaction imposes further delayswhenever the user interacts with the document, e.g., by scrolling to anew page or selecting or searching for words within the document. Eventhough the “image” of each page can be “zoomed” with the user's standardweb browser controls, the words of the document become distorted whenzoomed (as would any bitmapped image of text), causing Google to includea custom “zoom” control to avoid this distortion, but at the expense offurther delay due to additional server interaction.

In short, there remains a need for the true integration of PDF and otherdocuments into a web page that preserves the original nature andappearance of the documents (including in particular the original textfonts and the ability to search the text), allows for other componentsof the web page to coexist within the same web browser window, andenables users to read, interact with and control all components of theweb page (including the document) via the controls built into standardweb browsers.

In addition to reading a PDF or other document as an integral part of aweb page, users may also desire to share their reading-relatedactivities (e.g., viewing, annotating, rating, uploading and downloadingdocuments) with friends or other members of their social networks. Yet,actively choosing to share an activity or behavior is burdensome. Forthis reason, “passive sharing” is more desirable (i.e., settingpredefined sharing preferences, with future behavior resulting in theautomatic sharing of such behavior in accordance with thosepreferences).

While passive sharing is becoming increasingly more common, it has yetto be integrated into the activities or behavior within a websiteindependent of the sharing process itself. For example, the sharing ofactivities and behavior on a social networking site, such as Facebook,Twitter and MySpace, is integral to the nature of these sites. Sharingmessages, high scores of games played on the site and other activitiesis the very essence of participation in these social networks.

As these social networks have grown exponentially in popularity, evenexternal behavior is now being “passively shared” among members of thesesocial networks. For example, “Blippy” (a service offered via thewebsite, www.blippy.com) enables users to share their “purchasingbehavior” (i.e., purchases made anywhere via a credit card, registeredat the “Blippy” website) with other members of their social networks.Yet, even Blippy is designed with sharing as an integral component.Users already purchase items with their credit cards, and they alreadyshare their activities and behavior on their social networks with othermembers. Blippy simply connects the two, enabling the passive sharing ofthis existing external behavior (shopping) with users' existing socialnetworks (e.g., Facebook friends).

As the concept of “passive sharing” increases in popularity, there is adesire on the part of many users to enable their activities and behavioron a website (that are otherwise unrelated to their social networks) tobe passively shared among their social networks (even beyond thatwebsite).

II. SUMMARY

Various embodiments of the current invention are disclosed herein,including techniques, apparatus, and systems for preserving a document'soriginal nature and appearance when displaying the document within thepages of a website, and automatically sharing users' reading-relatedactivities on that website across their external social networks.

While various iterations of the HTML format have included over time afeature allowing for the downloading of custom fonts (“web fonts”) thatcan be embedded into web pages, web fonts have been employed to enhancethe authoring capabilities of HTML documents, rather than to facilitatethe integration of PDF and other documents into web pages. For example,the “@font-face” tag has been a component of the “Cascading StyleSheets” (CSS) specification for a number of years. Most recently, theHTML 5.0 specification, which relies upon CSS 3 (which includes the@font-face tag), has been (or soon will be) implemented in most majorweb browsers (e.g., Firefox, Safari, Internet Explorer, etc).

In one embodiment of the present invention, the @font-face tag isemployed in connection with the conversion of a PDF document into HTMLto ensure the preservation of the original fonts embedded within thatdocument. These fonts are downloaded and employed to generate theresulting HTML 5 document, which can then be integrated into any desiredweb page, as well as embedded into other web pages (e.g., by using thestandard HTML “iframe” tag). In this manner, the original appearance ofthe source document (PDF, in this embodiment) is maintained, the text ispreserved as searchable text, and the document is integrated into a webpage that can be searched, zoomed, scrolled, printed, etc., utilizingstandard web browser controls.

Moreover, because the PDF document is now an integral component of theresulting HTML 5 web page, a significantly increased “ad inventory” isenabled. Advertisements can be integrated between the individual pages(or even within a page) of the document. Even in the context of arelatively short 20-page document, there is at least a 20-fold increasein the ad inventory than would be present if the document were confinedto a separately scrolled window within the web browser's window.

In addition, the resulting document (independent of its format) can bepassively shared with desired members of a reader's external socialnetworks (as well as any social network within the host website), alongwith other reading-related activities and behavior performed by thereader on the website hosting the document. In one embodiment, a usersets predefined sharing preferences identifying particular socialnetworks (e.g., Twitter, Facebook, MySpace, and the host website'ssocial network) as well as specific activities and behavior on thewebsite to be shared on those social networks (e.g., in this embodiment,which documents have been viewed, downloaded or uploaded, or even howmany pages have been viewed, as well as annotations, ratings and variousother behavior or extracted analytics).

It should be noted that virtually any activities and behaviors within awebsite can be passively shared with a user's external social networks.In one embodiment discussed in greater detail below, a user'sreading-related activities within a host website are automaticallyshared with desired members of a user's social networks in accordancewith the user's predefined sharing preferences. The user simply accessesthe host website with the desire to read documents and perform otherreading-related activities, with the result that such activities areautomatically “passively shared” without any further action by the user.

The value of such passive sharing from a host website to members ofexternal social networks cannot be underestimated. In addition to thecommunication and other “community” benefits to users and other membersof their social networks, the host websites derive significant potentialvalue from the exponential targeted referral and advertisingopportunities. These benefits are described in greater detail below.

III. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of one embodiment of the platform and keysystem components employed by the present invention, including userdevices, host websites and key architectural components.

FIG. 2 a is a screenshot of a document converted into an HTML 5 documentand integrated into a web page in one embodiment of the presentinvention, illustrating the preservation of fonts from the originaldocument, as well as the integration of the document with other elementson the web page.

FIG. 2 b is a screenshot of a document converted into an HTML 5 documentand integrated into a web page in one embodiment of the presentinvention, illustrating the preservation not only of fonts from theoriginal document, but also the page layout of the original documentacross multiple pages.

FIG. 3 is a screenshot of a document converted into an HTML 5 documentand integrated into a web page in one embodiment of the presentinvention, illustrating the preservation not only of fonts from theoriginal document, but also searchable text displayed with its originalfonts.

FIG. 4 a is a screenshot of a document converted into an HTML 5 documentand integrated into a web page in one embodiment of the presentinvention, illustrating the insertion of advertisements between pages ofthe original document.

FIG. 4 b is a screenshot of a document converted into an HTML 5 documentand integrated into a web page in one embodiment of the presentinvention, illustrating the insertion of advertisements in the “openspace” within a page of the original document.

FIG. 5 is a flowchart illustrating a process of converting andintegrating a document (e.g, a PDF document) into an existing HTML 5 webpage in accordance with one embodiment of the present invention.

FIG. 6 is a screenshot of an initial “ReadCast” dialog box appearingnext to a document displayed on a web page in one embodiment of thepresent invention, illustrating the initiation of the process of settinga user's “passive sharing” preferences.

FIG. 7 a is a screenshot illustrating a user's “ReadCast” settings for aset of “passive sharing” preference controls displayed on a web page inone embodiment of the present invention.

FIG. 7 b is a screenshot illustrating alternate “ReadCast” settings (tothose illustrated in FIG. 7 a) for a set of “passive sharing” preferencecontrols displayed on a web page in one embodiment of the presentinvention.

FIG. 8 is a screenshot illustrating a Twitter dialog box invoked when auser selects the “ReadCast” setting (in one embodiment of the presentinvention) to “passively share” selected activities via the user'sTwitter account.

FIG. 9 is a screenshot of a “ReadCast” dialog box displayed on a webpage in one embodiment of the present invention, illustrating theconclusion of the process of setting a user's “passive sharing”preferences.

FIG. 10 is a flowchart illustrating a passive sharing process inaccordance with one embodiment of the present invention, including thesetting of a user's ReadCasting preferences and the automatic sharing(in accordance with those preferences) of the user's actions on a hostwebsite with the user's external social networks.

IV. DETAILED DESCRIPTION OF THE CURRENT INVENTION A. Integrated DocumentViewer

In one embodiment 100 of the present invention, illustrated in FIG. 1,the Internet 110 is the platform on which a set of documents (e.g., PDFdocuments, not shown) is shared between a host server 120, one or moreclient computers 130 and various members of social networks 140, some ofwhom are users of client computers 130. In this embodiment, Host server130 converts the original documents into HTML (in accordance with theHTML 5.0 and CSS 3 specifications), employing the @font-face tag todownload the original web fonts embedded in the documents, andintegrates the document into the desired layout of a web page.

In this manner, the appearance of each document within the web page ispreserved (as in the original document), including fonts and other pagelayout attributes. As will be illustrated below, the text remainssearchable and the document can be viewed and controlled via standardweb browser controls (without the need for any document-specificcontrols for printing, scrolling, zooming, etc). The remainder of theweb page (including areas within the document itself) may contain otherweb elements, including text, images, advertisements, animation, andvideo, as well as hyperlinks, buttons and various other static andinteractive objects and functionality.

When a user of one of client computers 130 accesses (via Internet 110)one of these documents integrated within a web page of a website hostedon host server 120, the user can perform various reading-related actionson that host website with respect to that document, such as reading,annotating, rating or downloading the document (as well as uploadingother documents). As will be illustrated below, the user can also set“ReadCasting” preferences which will automatically share such documentsand metadata relating to such activities with desired members of theuser's external social networks 140 (including the host website's ownsocial network, if any).

FIG. 2 a illustrates a web page 200 in which one of such documents 210is integrated, in accordance with one embodiment of the presentinvention. As is apparent from this screenshot, custom fonts 220 fromthe original document have been preserved, and the document isintegrated into the web page, with additional static and interactiveelements 230 included above and alongside the document (or, in otherembodiments, within the document itself).

FIG. 2 b illustrates a web page 250 containing a similar document 260that not only preserves the fonts 270 from the original document, butalso the page layout 280 of the original document across multiple pages.Thus, the appearance of the original document has been preserved, and itcan be scrolled along with any remaining elements (not shown) on the webpage via standard web browser scroll bars 290.

FIG. 3 illustrates a web page 300 containing a similar document 310 withpreservation of the appearance of the original document, includingcustom web fonts and various page layout attributes, and furtherillustrates that the text remains searchable (as opposed to mere imagesof the text fonts), as is evidenced by the highlighted portions 320 ofthe text. As noted above, not only can users search this text, which isparticularly useful for longer documents, but other programs can searchfor text, which can then be used for various purposes, such as providingtargeted advertisements relating to particular portions of text (e.g.,at the level of a document, an individual page or even specific words).

Regardless of their source, advertisements can be integrated not only onportions of the web page alongside the document (e.g., outside of thearea in which the document is displayed), but also within the documentitself. Because a long document is not confined to a separate fixedscrollable window within a web page, but rather extends the web pageitself to the full length of the document, the entire length of thedocument is available for associated advertisements.

FIG. 4 a illustrates advertisements 420 inserted in between pages of adocument 410. In another embodiment, such advertisements could also belocated alongside the document outside of the document's frame. ineither case, the advertisements would remain next to the relevantportions of the document as the entire web page is scrolled up and down.Similarly, FIG. 4 b illustrates advertisements 470 inserted into the“open space” within a page of the document 460.

One embodiment of the process of converting and integrating a document(e.g, a PDF document) into an existing HTML 5 web page is illustrated inFIG. 5. As noted above, this process 500, unlike the traditionalPDF-to-HTML conversion process, not only preserves the original fontsembedded within the document (in one embodiment, using the @font-facetag), but does so in a manner that enables the document to be integratedinto an existing web page, as well as embedded into other web pages(e.g., by using the standard HTML “iframe” tag). Thus, the originalappearance of the source document (PDF, in this embodiment) ismaintained, the text is preserved as searchable text, and the documentis integrated into a web page that can be searched, zoomed, scrolled,printed, etc., utilizing standard web browser controls (therebyproviding a significantly increased “ad inventory”).

In one embodiment, performance is enhanced for long documents by loadingdynamically only a few pages before and after the current page beingdisplayed. This decreases substantially the time required to load adocument initially, and to scroll from page to page. One tradeoff,however, is that current web browsers may not print a document correctlyif all pages are not loaded. In that case, however, users may save a PDFversion of the document which can then be printed.

Conversion process 500 begins with the input of a document (a PDFdocument in this embodiment) in step 510 which is to be converted andintegrated into an existing HTML 5 web page and rendered on a clientuser's web browser. The document is parsed in two passes, the first ofwhich (step 520) identifies various document statistics and layeringinformation for use in the second pass (step 530). During first pass520, the document is parsed sequentially for distinct document “assets”(e.g., text, fonts and images in this embodiment) until each such assethas been processed. Once no document assets remain to be processed, asdetermined in step 525, processing proceeds to second pass 530.

Otherwise, the identified asset is processed in step 527 (the mannerdepending upon the type of asset). For “font” assets, various statisticsare collected, such as the specific characters of that font actuallyused in the document (to save space and network bandwidth by ignoringunused characters), as well as the size, color, orientation and numberof occurrences of such characters. Of course, various differentcollections of statistics could be extracted in other embodiments.

Because PDF documents store fonts in a myriad of different formats(e.g., Type1, Type3, OpenType, etc.) that are not directly usable as webfonts, and because a font may be used in different places within thedocument with different encodings and/or transforms, the conversionprocess 500 uses the @font-face tag to generate a “custom” font that canbe used by a web browser as if it were one of the browser's “built-in”fonts. This aspect of process 500 occurs during second pass 530(explained in greater detail below), utilizing the statistics collectedduring this first pass 520.

For “text” and “image” assets, step 527 identifies and stores the pageof the document on which such assets occur, as well as the location ofsuch assets on that page. This information also will be utilized duringsecond pass 530.

Finally, in step 529, multi-layer objects are detected, and layering andclipping information is identified and stored for use during second pass530. Many document formats, including the PDF format, support richdocument structures that include multiple layers of objects, such asblocks of text layered on top of vector graphics, which may be layeredon top of other text objects that are layered on top of bitmaps, etc. Inaddition to this complex “z order” of objects, support for vector fills,gradient patterns, semitransparent bitmaps, clip polygons (that maskportions of layers below) and other structural document formattingfeatures, results in a complex multi-layer object hierarchy that (toconform to HTML5 standards) must be converted into a background imagewith some text on top. This aspect of process 500 occurs during secondpass 530 (explained in greater detail below), utilizing the layering andclipping information collected during this first pass 520.

Once all document assets have been parsed and processed in first pass520, conversion process 500 proceeds from step 525 to second pass 530.Here too, each asset (text, font and image assets in this embodiment) isparsed sequentially until no such assets remain, as determined in step535, at which point the web page elements will be stored on the hostserver at step 580 for subsequent delivery to and rendering on theclient's web browser, as discussed in greater detail below.

Otherwise, each asset is identified in step 545 as a text, font or imageasset. The parsing of each media asset during second pass 530 will nowbe discussed. For each “text” asset, word and character spacinginformation is extracted in step 550 (utilizing the asset statisticsgenerated during first pass 520) to determine the positions of eachcharacter and word of the text asset. Words are identified, for example,by detecting additional horizontal “space” between characters.

One embodiment of a paragraphization algorithm is employed, in step 552,to extract “high-level” information regarding text assets, such as linesand paragraphs. The location/position information extracted in firstpass 520, including character and word spacing information (from step552) is utilized to determine where lines and paragraphs begin and end.Various algorithms can be employed to resolve this basic problem—i.e.,identifying lines and paragraphs given “absolute location” information(e.g., spatial coordinates of characters and words employed by documentformats such as PDF), and generating “relative location” information vialine break, paragraph and other tags employed by the HTML 5 format.

In step 552, paragraph delimiters are identified to distinguish distinctparagraphs from one another. A typical paragraph “pattern” might consistof an indented first line. By detecting “lines” having similar “xcoordinates,” a consistently higher “x coordinate” indicates an indentedline. Similarly, an occasional doubled “y coordinate” differentialindicates another common paragraph “pattern” with a blank linedelimiting paragraphs.

In addition to detecting delimiters to identify distinct paragraphs,paragraph “justifications” (e.g., left, center and right justifications)are also identified in step 552. For example, consistent “x coordinates”at the beginning (but not the end) of each line of a paragraph indicatesa “left-justified” paragraph. Conversely, a “right-justified” paragraphexhibits consistent “x coordinates” at the end (but not the beginning)of each line of the paragraph. Finally, a consistent “x coordinate”differential between the beginning and end of each line of the paragraphindicates a “center-justified” paragraph.

The line spacing within (as well as between) paragraphs is discernedfrom “y coordinate” information, which is converted into appropriateHTML tags in step 554 to generate the appropriate line spacing. Linesand paragraphs detected in step 552 are also converted into HTML 5 (andCSS 3) in step 554 using respective line break (“<br>”) and paragraph(“<p>”) tags, among other text and layout-related attributes (such asthe text-indent CSS property). In other embodiments, additional line andparagraph attributes can be detected, and additional HTML tags can beemployed.

Having extracted the high-level line and paragraph information withrespect to the text asset in step 552, and converted this “absolutelocation” information in step 554 into the “relative location”attributes of the HTML 5 and CSS 3 formats, control is returned to step535 to determine whether any assets remain to be processed. If not, theconverted document elements (along with existing non-document elementson the web page) are stored on the host server in step 580, awaitingaccess during runtime.

Otherwise, if a “font” asset is identified, the glyphs (i.e., “images”of the characters of the font) are extracted in step 560. As notedabove, in one embodiment, only those glyphs that actually appear in thedocument are extracted (to save resources, such as memory and networkbandwidth).

These glyphs are mapped in a font file to the unicode representations ofthe characters they represent. To access the font file from an HTML 5web page, an @font-face CSS declaration is employed in the page styleblock for the font. This creates a custom font definition that can beused by a web browser as if the font were one of the browser's built-infonts.

In step 562, various geometric transforms are computed, if necessary,for specially formatted text. For example, if diagonal text is employed,each of the characters used in the document is converted, in oneembodiment, to a “rotated glyph” (using a simple geometric transform)and stored in a font file as a character of the custom font, mapped toits corresponding unicode representation. In this embodiment, thevertical positions of each character are also stored in the font file(mapped to the rotated glyphs and their unicode representations),reflecting the increasing or decreasing slope of successive characters.In other embodiments, information relating to the slope of the diagonal(and even to the rotation of each individual character) can bemaintained independently of the individual characters themselves.

Diagonal text can be detected directly from within a PDF document byvirtue of PDF support for rotated text. The presence of diagonal textmay also be inferred from the absolute position data (e.g., periodicallyincreasing or decreasing vertical coordinates of adjacent textcharacters) discerned from the document.

For other transforms, analogous adjustments are employed (in oneembodiment, on a character-by-character basis). Apart from theinformation stored in the font file, accessible via the @font-face tag,related attributes can be encoded natively in the HTML 5 web page, suchas character spacing, line-height, paragraphs, justification, etc.

Before converting (in step 564) these transformed sets of charactersinto the appropriate web-readable formats, the characters can, in oneembodiment, optionally be encrypted, in step 563 (as a form of HTML5-compliant “digital rights management” or DRM), to prevent users fromcopying and pasting the “protected” text into other environments. Unlikethe convoluted and easily circumvented methods currently employed toprevent the copying and pasting of text from within web pages (e.g.,often relying upon custom Javascript), this solution leverages the@font-face mechanism built into HTML 5 to map individual characters toalternative characters (e.g., a “tilde”) that can be displayed in theirplace when a user attempts a copy and paste operation. In other words,rather than attempting to inhibit the copy and paste operation, it isallowed to proceed, but with substituted “encrypted” versions of theactual characters.

Each glyph will still appear in the user's web browser as intended. But,it will also be mapped (on the host server, in one embodiment) to analternative “gibberish” character (e.g., a tilde), that in turn will bemapped to the actual unicode character itself (e.g., the letter “a”).Thus, the actual unicode character will remain available, for example,if the user desires to conduct a text search. But, if the user attemptsto copy and paste a block of text, the alternative characters will besubstituted and, upon being pasted, will show up as “gibberish”characters (thus preventing the unauthorized transfer of such text toother environments).

It should be noted that, for maximum security, the mapping of thecharacters is confined to the host server, which can be invoked togenerate the alternative characters when the user attempts to copy andpaste the “encrypted” text (e.g., using a simple Javascript call in thesource web page). In other embodiments, the mapping information can becontained within the files delivered to the user's web browser (avoidingthe need to invoke the host for this purpose), though potentiallycompromising security in the event a third party is able to discern ordisable this mapping process.

As noted above, PDF documents (among others) store fonts in a myriad ofdifferent formats (e.g., Type1, Type3, OpenType, etc.), which, to beusable as a web font, must be converted (e.g., into “eot,” “ttf” and“svg” formats, accomodating different positions, encodings, transforms,etc.). To accommodate differences among individual web browsers(including those on embedded devices, such as mobile phones), multiplefont files are employed to ensure @font-face support among the differingformats.

For example, in one embodiment, “.eot” formats are utilized for InternetExplorer, “.svg” formats for embedded devices and “.ttf” formats forFirefox, Safari, Chrome, etc. Thus, the @font-face CSS declaration forthe “Zapfino” typeface might look like the following:

@font-face { font-family: ‘Zapfino’; src: url (‘Zapfino.eot’); src: url(‘zapfino/zapfino.svg’) format (‘svg’); src: local (‘\u263a’), url(‘Zapfino.otf’) format (‘truetype’); }

Whether or not geometrically transformed and/or optionally encrypted,the glyphs and the corresponding unicode characters to which they aremapped, are then converted, in step 564, into the various web-readablefont file formats (“eot,” “ttf” and “svg”), after which control isreturned to step 535 to determine whether any assets remain to beprocessed. If not, the converted document elements (along with existingnon-document elements on the web page) are stored on the host server instep 580, awaiting access during runtime.

It should be noted that, in other embodiments, the conversion of fontsinto the various web-readable formats in step 564 can be performed atthe end of second pass 530 after all text, font and image assets havebeen parsed (as opposed to converting each font asset as it is parsed).

Finally, if an “image” asset is identified, and the image is a “vectorgraphic” image, then it is rasterized (i.e., converted into a “bitmap”image) in step 570. In other embodiments, vector graphics can besupported directly supported in HTML. Then, in step 572, graphic layersare merged. As noted above, the “z order” of multi-layer objects (e.g.,bitmaps on text on vector graphics, along with vector fills, gradientpatterns, clip polygons, etc.) must be preserved while generating asimpler HTML-friendly structure (e.g., text on background image).

In one embodiment, a boolean bitmap is maintained to facilitate thedetermination of whether particular page assets (bitmaps, text, vectorgraphics, etc.) share display space (in which case, for example,clipping is necessary to generate a merged bitmapped image). The booleanbitmap identifies the regions of a page that have currently been “drawn”(processed), and thus which pixels need to be checked for overlapagainst the current asset being processed.

In one embodiment, two boolean bitmaps are maintained—one for trackingthe area currently occupied by the next bitmap (or rasterized vectorgraphic) being added to the display stack, and the other for trackingthe area occupied by text objects. Until there exists overlap betweenthese two boolean bitmaps, the order in which they are drawn makes nodifference.

In this manner, the two boolean bitmaps are refined in step 572 as eachasset is processed, until a “final” background image is generated(taking into account any previously overlapping text) on top of whichthe “final” text layer is placed. It should be noted that, where whitespace exists between image assets, the image is split into separatefiles in step 574. And, in step 576, the image may need to be scaled,converted or otherwise reformatted, depending upon its original formatand the size and position information previously extracted. In otherembodiments, step 574 and 576 can (like step 564) be performed at theend of second pass 530 after all text, font and image assets have beenparsed (as opposed to splitting files and reformatting each image assetas it is parsed).

Finally, control is returned to step 535 to determine whether any assetsremain to be processed. Once all text, font and image assets have beenprocessed, the converted document elements (along with existingnon-document elements on the web page) are stored on the host server instep 580, awaiting access during runtime.

When accessed by a client web browser during runtime, the document andnon-document elements (including the insertion of ads that may changedynamically) are loaded on the host server in step 585 and delivered tothe client web browser, where they are integrated and rendered, in step590, on the client computer.

B. Automatic Sharing of Reading-Related Activities Across ExternalSocial Networks

As alluded to above, in one embodiment of the present invention, usersof a website engage in various reading-related activities with respectto documents hosted on the website, regardless of whether such documentshave been converted so as to retain the appearance of the originaldocument (as discussed in Section A above). These reading-relatedactivities include reading, annotating, rating or downloading (as wellas uploading) documents. Note that, in other embodiments, various otheractivities could be included and shared, such as the particular page orportion within a document that a user is reading, the number of pagesread or even the time spent reading a particular document). In oneembodiment, Moreover, activities beyond those that are reading-related,could be shared with external social networks in a similar fashion tothat described herein.

FIG. 6 illustrates an embodiment of an initial “ReadCast” dialog box 610next to a document 620 displayed on a web page 600. While a user'sintent in accessing web page 600 is to read document 620 and engage invarious other reading-relating activities, this dialog box 610 presentsthe user with an opportunity to set certain “passive sharing”preferences (not shown) that will result in the automatic sharing of theuser's future reading-related activities with desired members of theuser's external social networks. For example, after setting thesepreferences, the user might select a particular document, causing thesystem to automatically notify the user's Facebook friends (inaccordance with the user's specified preferences) that the user haselected to read that particular document. In another embodiment (notshown), whenever a user reads a document, a list of all users who haveread the document is displayed next to the document.

One embodiment of these “ReadCast” settings is illustrated in FIG. 7 a,which includes various preference controls 700 covering activities suchas “Reading” 702 a document, “Downloading” 704 the document (or sendingit to the user's mobile phone via the “Send to Mobile” activity 706),“Rating” 708 the document and “Scribbling” 710 (i.e., annotating thedocument). In addition, the user specifies, with respect to varioussocial networks 715 (e.g., Facebook 716, Twitter 717 and the Scribdwebsite's own “internal” social network 718), whether each of theactivities is shared (by specifying, for each activity, “always” share,“never” share, or “ask” the user at the time of engaging in the activitywhether to share such action with the specified social networks).

For example, in FIG. 7 a, the user has enabled all activities 700 andselected the “ask” radio button for each of them (with the exception ofthe Scribd social network, for which Rating and Scribbling can only beset to always be shared). Thus, when the user reads a particulardocument (or rates, annotates, downloads or sends the document to theuser's mobile phone), the system will automatically ask the user whetherto share such information with the user's specified social network(e.g., Facebook friends or Twitter followers).

FIG. 7 b illustrates alternative ReadCast settings. For example, the“Send to Mobile” 706 and “Scribbling” 710 activities have been disabledby the user, and the “Reading” 702 activity is set to “always” be sharedon Scribd 718 and “never” be shared on Facebook 716, while the Ratingactivity is set to “always” be shared on Facebook 716 and Twitter 717.

FIGS. 7 a and 7 b also include a “Link to Account” button 720 to enablethe user to designate and access their particular Facebook or Twitteraccount. FIG. 8 illustrates a Twitter dialog box 810 that is invokedwhen the user selects the “Link to Account” button under the Twittercolumn. This dialog box 810 provides the user with the opportunity(i.e., an additional layer of security provided by the social networkingsite) to allow or deny the host website access to the user's Twitteraccount (e.g., to share the user's designated activities on the Scribdwebsite with the user's Twitter account).

After completing the designation of the desired ReadCast preferences,the user selects the “Save Changes” button 730 (shown in FIGS. 7 a and 7b), which results (in one embodiment) in the dialog box 910 illustratedin FIG. 9. This dialog box 910 summarizes the user's selectedpreferences (e.g., indicating the social network(s) on which the user'sactivities are shared). In other embodiments, the specific activitiesthat are enabled can be displayed.

Once these ReadCast “passive sharing” preference settings have beensaved, whenever the user performs one of the designated activities onthe host website, a notification indicating that the user has performedthat activity will be shared on the user's designated social networks(e.g., Facebook or Twitter, as well as the host Scribd network) withoutrequiring any further action by the user.

In another embodiment, a list of a user's “friends” or other contacts onexternal social networks is identified and maintained, and ReadCastnotifications to anyone on that list are forwarded to the user's Scribdfriends, thereby further extending such notifications to a social“network of networks” or a “social Internet.” This is accomplished byusing the APIs provided by external social networks (e.g., “FacebookConnect”) to copy and retain a portion of the user's “social graph” or alist of friends. Once the user's social graph is copied to the socialnetwork within the host website, specific activities can be shared withthat user's social network without further interaction with externalsocial networks or services.

A more detailed description of one embodiment of the passive sharingprocess is illustrated by the flowchart in FIG. 10. As discussed above,a user initially encounters on the host website (e.g., via dialog box610 shown in FIG. 6) an opportunity to set initial ReadCast settings,represented by step 1010 in FIG. 10. The system 1000 then displays, instep 1012, the user's default ReadCast settings. The user then setsdesired preferences in step 1014, by associating particular activitieswith specified social networks, as explained above with respect to FIGS.7 a and 7 b. Upon initially saving those preferences (which, in oneembodiment, the user can revise at any time), system 1000 enables, instep 1020, the ReadCast passive sharing behaviors.

As users perform various reading-related activities on the host website,system 1000 detects, in step 1050, a user's performance of one of thepredefined actions, and checks, in step 1055, to determine whether thatuser's ReadCast settings are enabled. If that user's ReadCast settingsare not enabled, system 1000 simply permits the user to continueperforming the desired reading-related activity (step 1090).

Otherwise, system 1000 identifies, in step 1060, the particular activitybeing performed by the user and accesses, in step 1062, the user'sReadCast preferences to determine, in step 1065, whether the user'sReadCast settings are enabled for that particular activity. If not,system 1000 (as above) permits the user to continue performing thedesired reading-related activity (step 1090).

If the user's ReadCast settings are enabled for that particularactivity, then system 1000 identifies, in step 1067, the conditionsunder which the activity will be “passively shared” with the user'sspecified social networks. For example, as noted above with respect toFIGS. 7( a) and 7(b), the user may have enabled that activity to alwaysbe shared with certain social networks and never be shared with others(and perhaps to be asked at the time whether to share the activity withcertain other social networks). Of course, in other embodiments,additional options and conditions could be specified.

Finally, to the extent a particular activity (e.g., reading a particulararticle on the host website) has been designated to be shared with oneor more of the user's social networks, then system 1000 proceeds, instep 1069, to initiate the “passive sharing” of that activity—e.g., tonotify one or more of the user's designated social networks that theuser has engaged in that particular activity. System 1000 (as above)then permits the user to continue performing the desired reading-relatedactivity (step 1090).

It should be emphasized that various modifications and combinations ofthe above-described embodiments can be employed without departing fromthe spirit of the present invention.

1. A method for converting and integrating non-HTML documents into HTMLweb pages on a host server while preserving the original appearance andtext searchability of the documents, the method including the followingsteps: (a) parsing a document to extract text characters and associatedfonts, as well as page layout attributes of the document, each glyph ina font representing the appearance of its associated text character withrespect to that font; (b) integrating the text characters into an HTMLweb page and generating HTML tags to preserve the document's page layoutattributes; (c) generating one or more font files, accessible from theHTML web page, that map the text characters to their associated glyphs;and (d) storing the HTML web page and font files on the host server fordelivery to and rendering within the window of a client web browser,whereby the original appearance and text searchability of the documentis preserved.
 2. The method of claim 1 wherein the CSS 3 @font-face tagis employed to link the font files to the HTML web page.
 3. The methodof claim 1 wherein the HTML web page contains a plurality of web pageelements external to the document, and wherein the document and theplurality of web page elements can be displayed within the client's webbrowser window.
 4. The method of claim 1 wherein a user of the client'sweb browser can select and search for text within the document using theclient web browser's standard controls.
 5. The method of claim 1 whereina user of the client's web browser can zoom text within the document andscroll among the pages of the document using the client web browser'sstandard controls.
 6. The method of claim 3 wherein the plurality of webpage elements include an advertisement external to the document.
 7. Themethod of claim 6 wherein the advertisement is located to the side of apage of the document, whereby the ad inventory of the web page isproportional to the number of pages of the document.
 8. The method ofclaim 1 wherein the font files include, for each text character, amismatched character code that does not correspond to the character'sassociated glyph, and wherein the HTML web page contains the mismatchedcharacter codes and instructions directing the web browser to use thefont files for displaying the glyphs, whereby the web browser utilizesthe font files to display the text characters correctly, but cannotsearch for or copy the text characters due to the mismatched charactercodes in the HTML web page.
 9. The method of claim 1 wherein the pagelayout attributes of at least some portion of the document are specifiedin the HTML web page by the organization of the text characters intowords, lines and paragraphs, and wherein the page layout attributes arepreserved by: (a) extracting from the document absolute positioninformation relating to the text characters; (b) analyzing the absoluteposition information to identify relative position information,including the beginning and end of individual words, lines of text andparagraphs of text; and (c) generating HTML tags, from the relativeposition information, to delineate the beginning and end of individuallines of text and paragraphs of text.
 10. The method of claim 1 whereinthe page layout attributes of the document include diagonal text, andwherein the page layout attributes are preserved by: (a) detecting thepresence of diagonal text while parsing the document; (b) generating,via a geometric transformation, a rotated glyph corresponding to eachtext character of the diagonal text; and (c) mapping, to each rotatedglyph, vertical position information that enables the client's webbrowser to render the diagonal text.
 11. The method of claim 10 whereinthe presence of diagonal text is detected by extracting from thedocument absolute position information relating to the text characters,and identifying periodically increasing or decreasing vertical offsetsof adjacent text characters.
 12. A method for displaying text in a webpage using the built-in functionality of a web browser, while inhibitingthe use of that functionality to search for and copy the text, themethod including the following steps: (a) generating a font filecontaining, for each text character, a corresponding glyph representingthe appearance of that character, and a mismatched character code thatdoes not correspond to the glyph; and (b) generating an HTML documentthat contains the mismatched character codes and instructions directingthe web browser to use the font file for displaying the glyphs, (c)whereby the web browser utilizes the font file to display the textcorrectly, but cannot search for or copy the text due to the mismatchedcharacter codes in the HTML document.
 13. A system that converts andintegrates non-HTML documents into HTML web pages on a host server whilepreserving the original appearance and text searchability of thedocuments, the system comprising: (a) a document parser that extractstext characters and associated fonts, as well as page layout attributesof the document, each glyph in a font representing the appearance of itsassociated text character with respect to that font; (b) an HTMLconverter that integrates the text characters into an HTML web page andgenerates HTML tags to preserve the document's page layout attributes;(c) a font file generator that generates one or more font files,accessible from the HTML web page, that map the text characters to theirassociated glyphs; and (d) a website host on the host server that storesthe HTML web page and font files for delivery to and rendering withinthe window of a client web browser, whereby the original appearance andtext searchability of the document is preserved.
 14. The system of claim13 wherein the CSS 3 @font-face tag is employed to link the font filesto the HTML web page.
 15. The system of claim 13 wherein the HTML webpage contains a plurality of web page elements external to the document,and wherein the document and the plurality of web page elements can bedisplayed within the client's web browser window.
 16. The system ofclaim 13 wherein a user of the client's web browser can select andsearch for text within the document using the client web browser'sstandard controls.
 17. The system of claim 13 wherein a user of theclient's web browser can zoom text within the document and scroll amongthe pages of the document using the client web browser's standardcontrols.
 18. The system of claim 15 wherein the plurality of web pageelements include an advertisement external to the document.
 19. Thesystem of claim 18 wherein the advertisement is located to the side of apage of the document, whereby the ad inventory of the web page isproportional to the number of pages of the document.